Skip to content

[BWARE] Tune compressed matmul fast paths and Spark execution decisions#2483

Merged
Baunsgaard merged 7 commits into
apache:mainfrom
Baunsgaard:split/mmChainTsmm
Jun 23, 2026
Merged

[BWARE] Tune compressed matmul fast paths and Spark execution decisions#2483
Baunsgaard merged 7 commits into
apache:mainfrom
Baunsgaard:split/mmChainTsmm

Conversation

@Baunsgaard

@Baunsgaard Baunsgaard commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Mixes two related performance changes: refined compressed multiply heuristics, and a Spark-vs-CP decision refresh on the Hop layer.

CLALib matmul changes:

  • CLALibMMChain: for XtXv with few col groups and a wide-enough matrix, compute X' * X via leftMultByTransposeSelf and finish with a regular matrix multiply against v. Cheaper than chaining when the X' * X path can stay compressed
  • CLALibTSMM: refactor leftMultByTransposeSelf into a package-private helper so MMChain can call it; widen the ColGroupUncompressed handling
  • CLALibRightMultBy: stop forcing decompression for ASDC / ASDCZero inputs; they have working preAggregate paths that beat the dense fallback
  • CLALibCompAgg: fix blklen rounding so the last partition is not short by k rows on parallel aggregates

Spark/CP exec-decision refresh (Hop, UnaryOp, BinaryOp):

  • Hop: new helpers hasSparkOutput() and isScalarOrVectorBelowBlockSize() shared between unary and binary decision points
  • UnaryOp.optFindExecType: replace the inline chain of negations with isDisallowedSparkOps(), allow Frame outputs, and pull unary ops into Spark when the input already has a Spark output; gated on the ALLOW_TRANSITIVE_SPARK_EXEC_TYPE flag so it shares a kill-switch with the binary path
  • BinaryOp.optFindExecType: same kind of restructuring; allow matrix-or-frame outputs to be pulled into Spark when exactly one operand is a scalar or small vector

Instruction-side adjustments:

  • VariableCPInstruction (CAST_AS_MATRIX from frame): use the parallel MatrixBlockFromFrame.convertToMatrixBlock(fin, k) path instead of the single-threaded DataConverter helper
  • ParameterizedBuiltinCPInstruction (transformdecode): call the parallel decoder.decode(data, out, k) overload using InfrastructureAnalyzer.getLocalParallelism()
  • DecoderComposite (parallel decode): parallelize over row blocks instead of over decoders, so the sub-decoders still run in order within each block (e.g. recode-on-output depends on the category indexes from the preceding dummycode decoder); fall back to the sequential path for k <= 1

Testing:

  • New CompilerTestBase harness for compile-time exec-type assertions, plus SparkTransitiveExecTypeCompileTest and an end-to-end SparkTransitiveExecTypeTest (with DML) covering the unary/binary transitive Spark pull, its multi-consumer guard, the cumulative-op exclusion, and the flag-off kill-switch
  • CLALibMMChainTest, CLALibRightMultBySDCTest, and DecoderCompositeTest for the compressed matmul fast paths, ASDC right-multiply, and parallel composite decode; shared helpers added to CompressedTestBase

@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.65079% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.45%. Comparing base (e4f0987) to head (f4ead24).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/main/java/org/apache/sysds/hops/UnaryOp.java 72.72% 0 Missing and 3 partials ⚠️
src/main/java/org/apache/sysds/hops/BinaryOp.java 92.85% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2483      +/-   ##
============================================
- Coverage     71.47%   71.45%   -0.03%     
- Complexity    48883    48885       +2     
============================================
  Files          1573     1573              
  Lines        189238   189266      +28     
  Branches      37128    37137       +9     
============================================
- Hits         135261   135234      -27     
- Misses        43530    43572      +42     
- Partials      10447    10460      +13     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Baunsgaard Baunsgaard force-pushed the split/mmChainTsmm branch 2 times, most recently from 4a9df3c to 85916ba Compare June 16, 2026 12:18
Mixes two related performance changes: refined compressed multiply
heuristics, and a Spark-vs-CP decision refresh on the Hop layer.

CLALib matmul changes:
- CLALibMMChain: for XtXv with few col groups and a wide-enough
  matrix, compute X' * X via leftMultByTransposeSelf and finish with
  a regular matrix multiply against v. Cheaper than chaining when
  the X' * X path can stay compressed
- CLALibTSMM: refactor leftMultByTransposeSelf into a
  package-private helper so MMChain can call it; widen the
  ColGroupUncompressed handling
- CLALibRightMultBy: stop forcing decompression for ASDC / ASDCZero
  inputs; they have working preAggregate paths that beat the dense
  fallback
- CLALibCompAgg: fix blklen rounding so the last partition is not
  short by k rows on parallel aggregates

Spark/CP exec-decision refresh (Hop, UnaryOp, BinaryOp):
- Hop: new helpers hasSparkOutput() and
  isScalarOrVectorBellowBlockSize() shared between unary and binary
  decision points
- UnaryOp.optFindExecType: replace the inline chain of negations
  with isDisallowedSparkOps(), allow Frame outputs, and pull unary
  ops into Spark whenever the input already has a Spark output
- BinaryOp.optFindExecType: same kind of restructuring; allow
  matrix-or-frame outputs to be pulled into Spark when exactly one
  operand is a scalar or small vector

Instruction-side adjustments:
- VariableCPInstruction (CAST_AS_MATRIX from frame): use the
  parallel MatrixBlockFromFrame.convertToMatrixBlock(fin, k) path
  instead of the single-threaded DataConverter helper
- ParameterizedBuiltinCPInstruction (transformdecode): call the
  parallel decoder.decode(data, out, k) overload using
  InfrastructureAnalyzer.getLocalParallelism()
The multi-threaded DecoderComposite.decode submitted one task per
decoder per row block, running all decoders concurrently. This broke
the ordering dependency between decoders: recode-on-output reads the
category indexes written by the dummycode decoder, so when the recode
task raced ahead it read unwritten cells and produced null or the raw
index instead of the original value.

Parallelize over row blocks instead, running all decoders in order
within each block via the sequential block decode. Also short-circuit
to the single-threaded path when k <= 1.

Fixes order-dependent failures in TransformFrameEncodeDecodeTest and
TransformFrameEncodeColmapTest (dummycode single-node/hybrid) that
surfaced once transformdecode started using the parallel decode path.
…isions

Re-enable the betterIfDecompressed gate in CLALibRightMultBy so the
decompressing right-multiply path stays reachable, while still excluding
ASDC/ASDCZero column groups from forced decompression.

Add targeted and end-to-end tests covering the recently tuned paths:
- CLALibMMChainTest: the public CLALibTSMM.leftMultByTransposeSelf overload
  (wide/narrow/uncompressed/empty/reuse/null) and the XtXv mm-chain fast
  path, including a tile-then-recompress wide-chain case.
- CLALibRightMultBySDCTest: right multiply on ASDC/ASDCZero inputs is not
  forced to decompress, single-threaded and parallel.
- DecoderCompositeTest: parallel and single-thread composite decode,
  exercising the dummycode+recode ordering dependency.
- SparkTransitiveExecTypeTest with DML scripts: UnaryOp/BinaryOp/Hop
  transitive Spark exec-type pulling under a constrained memory budget.
- CompressedTestBase: two parameterized e2e cases that validate the TSMM
  overload and the wide XtXv fast path against uncompressed results across
  all compression configurations.
…e test

Make the spark-specific decision refinement consistent between UnaryOp
and BinaryOp:
- UnaryOp: restore the input-is-not-checkpoint and single-parent guards,
  and drop the redundant `_etype != ExecType.SPARK` clause
- BinaryOp: use the shared hasSparkOutput() helper instead of an inline
  optFindExecType() == SPARK check

Add a compilation-verification test suite under component/compile that
compiles a DML script into a runtime program and inspects instruction
exec types without executing. CompilerTestBase provides the compile and
plan-inspection helpers; SparkTransitiveExecTypeCompileTest verifies a
CP-by-estimate unary on a Spark-resident input is pulled into Spark only
when it is the sole consumer.
…e ops

Extend SparkTransitiveExecTypeCompileTest to cover branches that were
previously untested in the exec-type refinement:

- BinaryOp transitive pull: a tall (multi-block) but byte-small Spark-resident
  column vector in a matrix-scalar op is pulled into Spark when it is the sole
  consumer, and stays CP when it has multiple consumers.
- UnaryOp cumulative exclusion: a cumulative op on a Spark-resident input stays
  CP because cumulative ops are excluded from the transitive pull.

These exercise Hop.isScalarOrVectorBellowBlockSize and hasSparkOutput across
both branches. Inputs are consumed via indexing to avoid the sum(cumsum)
algebraic rewrite that would otherwise eliminate the operator under test.
…nt nits

Align the unary transitive Spark exec-type refinement with BinaryOp and
clean up the new helpers introduced for the decision refresh:

- Gate the UnaryOp transitive pull on the transitive flag so
  ALLOW_TRANSITIVE_SPARK_EXEC_TYPE acts as a real kill-switch for both
  unary and binary pulls. Default is true, so default behavior is
  unchanged; disabling the flag now keeps a pullable unary op in CP.
- Rename Hop.isScalarOrVectorBellowBlockSize to isScalarOrVectorBelowBlockSize
  (spelling) and parenthesize the row/col clauses symmetrically.
- Drop the unused k parameter from the private CLALibTSMM.addCorrectionLayer
  overload.
- Simplify the redundant "v1 == true" in BinaryOp and fix a misleading
  comment on the disallowed-ops guard.
- Add flag-off compile tests proving the kill-switch keeps an otherwise
  pullable unary and binary op in CP.
Shorten redundant inline comments added with the exec-type refinement:
drop the parenthetical on the UnaryOp transitive guard, condense the
compile-test class javadoc, and remove the noisy back-references and
duplicated assert comments in the flag-off tests.
@Baunsgaard Baunsgaard merged commit d71d469 into apache:main Jun 23, 2026
50 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant